-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41834: [R] Better error handling in dplyr code #41576
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format?
or
In the case of PARQUET issues on JIRA the title also supports:
See also: |
…apping; make arrow_eval error
178648a
to
0cd2ff3
Compare
|
abort("`...` argument to `across()` is deprecated in dplyr and not supported in Arrow") | ||
arrow_not_supported( | ||
"`...` argument to `across()` is deprecated in dplyr and", | ||
body = c(">" = "Convert your call into a function or formula including the arguments"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL about this making arrows!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! I'm really excited to see more helpful warnings.
I went through some of the code pretty thoroughly, but mostly skimmed the dplyr-{verb}.R
files since those are all(?) indentation changes, yeah?
names(sorts)[i] <- format_expr(exprs[[i]]) | ||
if (inherits(sorts[[i]], "try-error")) { | ||
msg <- paste("Expression", names(sorts)[i], "not supported in Arrow") | ||
return(abandon_ship(call, .data, msg)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an example of "not just an indentation change": in the new code, we don't have to evaluate, catch the error, and re-raise in abandon_ship, we just let arrow_eval()
raise, and try_arrow_dplyr()
handles the abandon_ship.
!is.null(results[[new_var]])) { | ||
# We need some wrapping to handle literal values | ||
if (length(results[[new_var]]) != 1) { | ||
arrow_not_supported("Recycling values of length != 1", call = exprs[[i]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's another not-just-indentation change: for validations/errors outside of arrow_eval, we just raise arrow_not_supported
or validation_error
like in the function bindings.
Mostly. I just went back and commented on the PR in a couple places that show some of the non-indentation changes. |
I should probably go and add some sentences to the |
Do we definitely still want/need that article? I am a big +1 to removing redundant docs/code, and given that it's buried in the developer docs and it's not likely there'll be a ton of new Acero functions, we could, like, just delete it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always a fan of UX changes like this, and I love the usage of →
to suggest a concrete action. Out of curiosity, is this something we're emulating from somewhere else or something you came up with on this PR @nealrichardson?
I'm cool with deleting it. You're right that it's from another era in the package's development. And if someone is going to add more bindings, there's hundreds of examples to copy now.
I guess I came up with it. Looking at the options in cli (https://cli.r-lib.org/reference/cli_bullets.html), I wanted to reserve the |
### Rationale for this change Missed this in #41576 ### Are these changes tested? We should make sure. ### Are there any user-facing changes? No.
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 774ee0f. There were 8 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 9 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Necessary for a clean check. These were inadvertently taken out in #41576 and don't actually change any code, just appeases the static checker that CRAN runs. Authored-by: Jonathan Keane <[email protected]> Signed-off-by: Jonathan Keane <[email protected]>
Necessary for a clean check. These were inadvertently taken out in #41576 and don't actually change any code, just appeases the static checker that CRAN runs. Authored-by: Jonathan Keane <[email protected]> Signed-off-by: Jonathan Keane <[email protected]>
### Rationale for this change The writing-bindings vignette was removed in #41576 (comment). It turns out there were more references to it throughout the docs that I failed to remove ### What changes are included in this PR? Deleting x-refs that don't exist anymore. ### Are these changes tested? Not really ### Are there any user-facing changes? The docs won't point you at links that 404. * GitHub Issue: #43665
) ### Rationale for this change The writing-bindings vignette was removed in apache#41576 (comment). It turns out there were more references to it throughout the docs that I failed to remove ### What changes are included in this PR? Deleting x-refs that don't exist anymore. ### Are these changes tested? Not really ### Are there any user-facing changes? The docs won't point you at links that 404. * GitHub Issue: apache#43665
) ### Rationale for this change The writing-bindings vignette was removed in apache#41576 (comment). It turns out there were more references to it throughout the docs that I failed to remove ### What changes are included in this PR? Deleting x-refs that don't exist anymore. ### Are these changes tested? Not really ### Are there any user-facing changes? The docs won't point you at links that 404. * GitHub Issue: apache#43665
) ### Rationale for this change The writing-bindings vignette was removed in apache#41576 (comment). It turns out there were more references to it throughout the docs that I failed to remove ### What changes are included in this PR? Deleting x-refs that don't exist anymore. ### Are these changes tested? Not really ### Are there any user-facing changes? The docs won't point you at links that 404. * GitHub Issue: apache#43665
I started out trying to make it so that
arrow_eval()
could just raise its errors, rather than catch them and have every caller inspect and re-raise. I ended up pulling on this further and ended up refactoring most of the error handling in the dplyr code paths. Summary of changes, from the bottom up:arrow_not_supported()
(which previously existed but just calledstop()
) andvalidation_error()
. They raisearrow_not_supported
andvalidation_error
, respectively. Function bindings now raise one or the other, never just stop/abort.arrow_eval()
modifies the errors raised by function bindings, inserting the expression as thecall
attribute of the error, which letsrlang
handle the printing cleaner, and catching any non-classed errors and re-raising them asarrow_not_supported
orvalidation_error
, as appropriate.try_arrow_dplyr()
wrapper around everything inside (most*) dplyr verb implementations, which only callsabandon_ship()
onarrow_not_supported
errors, and lets all other errors just raise. For datasets, it just adds an additional note to the error message advising you that you can callcollect()
. So errors generally bubble up, and each of these wrappers adds some context to the message.The ultimate results of all of this:
collect()
(or, if on in-memory data, just do it) in cases where it would also fail in regular dplyr because the input is invalid.Error: Error :
messages.collect()
. In fact, if there are suggestions with the ">" (arrow) bullet, we don't just add "Call collect()", we say "Or, call collect()".arrow_eval()
and the dplyr verbs in general. There's less bookkeeping you have to do to catch and rethrow errors, and it's consistent across the various parts of the evaluation (i.e. the same thing works inside the dplyr verbs as in the bindings).Some concrete examples:
summarize()
but not caught insidearrow_eval()
because it's not about the expressions.